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MICROPROCES SOR INSTmJCTION EXEOmON METHOD FOR 
' \ EXPLOmNG PARALLEUSM ^ 

■ 

TECHNICAL FIELD 

TTie present invention is in the field of digital computing systems. In particular, it rektes to a 
method of effident instmcdon execution in a microprocessor system. 

BACKGROUND ART 

Very Long Instruction Word (VIIW) microprocessor architectures are able to perform a large 
number of parallel operations on each dock cyde. However, the characteristic of most non- 
numerical code is that there are a large number of potential dependendes between 
instructions. That is, one instruction is reliant upon the results of a previoiis instruction and so 
cannot be executed concurrently vwth it This means that the instmction stream often 
becomes sparse with many functional units unused during many cycles. 

A significant contributor to this restriction is the memory alias problem. In languages such as 
C or C++ diere is heavy usage of pointer memory accesses. It is extremely difficult, and often 
impossible, to trace data flow within a program at compilation time to determine the set of 
objects that a particular pointer mig^t access at any particular time. Hiis imposes severe 
restrictions on performing bad and store operations out-oforder. Whenever a store operation 
is perfDrmed via a pointer it could potential^ write to any address. Thus subsequent loads 
cannot be moved earlier than the stx>re in case they are ^'aliased" with the store. This severdy 
restricts parallelism since, in most cases, the memory accesses are not actually aliased 

Some hig^ end processors have hardware blocks diat analyze the addresses for stores as they 
are calculated during program execution. They can riien be compared against subsequent 
loads. Ihis allows greater parallelism as loads can be issued earlier than the store. If there is an 
address match then the hardware takes correcthre action, sudi as re-executing the load after 
the sibre is complete. However, such processors are extreme^ complex and are not suitable 
for lower cost embedded applications. Some architecture/compiler combinations can generate 
code to statically issue loads before potential^ aliased stores. Additional code is then generated 
to latef compare die addresses and branch to special compensation code to preserve correct 
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pxogtam semantics in the unlikely event that the accesses axe indeed aliased. Unfortunately this 
adds significant code si2e overhead and can only be used in limited c^es. 

, A fiardier constraint on parallelism is the number of branches that occur within code. In non- 
numeric applications, a conditional branch operation is generally performed every few 
instructions. A branch causes divetg^ce of die possible instruction streams so diat different 
operations are performed depending on the condition. This also restticts die number of 
parallel operations on a VLIW processor. Branches also cause problems witii die operation of 
the pipelines used in processors. These pipelines fetch instructions several dock cycles before 
the instmcdons are actua% executed If that is dependent on some condition that is only 
calculated just before the branch then it is difScuIt to avoid a pipeline staL During a stall the 
processor performs no usefiil wods for sevetal cycles until the correct instruction is fetched 
and works its way down the pipeline. 

Most hig^-end processors include some form of branch prediction scheme. There are many 
levels of solution complexity, but all tiy and guess ^^ch way a particular branch will go on the 
basis of compiler anafysis and the history of which way the branch has gone in die past Many 
of these processors can then speculatively execute code on die assumption the branch will go 
a particular way. The results Goom this speculative execution can then be undone (or squashed^) 
sHodd the assumption prove to be incorrect Some processors have a predicated execution 
mechanism. This allows some branches to be simplified by eliminating the branch and 
executing its target code conditional^. However, diis technique can generally onfy be applied 
to a limited set of branches in code. 
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SUMMABY OF IHE INVENHON 

The preferred embodiment performs all execution within code blocks called regions. A region 
is simply a fixed sequence of execution words. Each individual word may encode multiple 
individual operations. No branches are performed during the execution of a region. Instead, 
multiple potential branch destinations may be evaluated d\iring the execution of die region. 
When the end of a region is reached the code will branch to a successor r^on. Hius only one 
actual branch is executed per re^on. 

A tegon is constructed from a sequence of contiguous instructions ftom an original 
sequential reptesentation of a program. This sequence may be subdivided into a number of 
"sttands". Bach strand is a subset of instructions from the region. The sttands ate numbered 
in terms of their or^pial order in the sequential program representatioa To maintain the 
original program semantics the dependencies betv/een operations in different strands must be 
observed as thou^ they were being executed in the original strand order. 

Tte preferred embodiment provides much greater scope for aDowing the reordering of 
operations bebng^ to different sttands. If necessary the dependencies between operations in 
different strands can be violated. The architecture contains hardware mechanisms to recover 
ftom such an event and ultimately produce the correct results. Lnportandy, the hardware 
ovediead to achieve this is smaQ in comparison to other processor architectures tiiat support 
**Out-of-Order^' executiorL 

The code generator can generate a code sequence tiiat mixes operations as appropriate fiom 
different strands. Internal strand dependencies are maintained but if there are unused 
functional units in a particular execution word then appropriate operations can be selected 
from other strands. In diis way each of tixe flinctional units can be kept busy for a mvich 
higjher percentage of the time and the instances of unused units are significantly reduced. 

The execution of each strand is conditional An operation in one strand can cause a later 
strand to be squashed If a strand is squashed then all the results it generates are nullified, as 
thougji it has not been executed at all Many control flow constructs can be implemented 
using strands. For instance, if a branch conditionally jumps over a block of code then that 
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code can be formed into a strand Its execution is made conditional on the inverse of the same 
calculation previously xised for the conditional brandbu 

Multiple sttands %vithin a re^on can be psed to represent control flow of arbitraiy complexity, 
including if-then-else constmcts and conditional sttands \irithin other conditional strands. All 
this conditional execution can be performed using the static schedule of operations and 
without any complex hardware or branches tiiat would upset the efficiency of the processor 
pipeline. 

To improve operation parallelism a bad firom a later strand can be executed before a store 
from an earlier ^e. lower numbered strand). Thus calculations for the two strands can 
proceed in parallel without serialization enforced by any potential address alias. Hie code 
indxides a special check operation. This is a simple address comparison operation between the 
store and the load address. If they are not the same then execution of the two strands can 
continue. If they are identical then the strand containing the later load is aborted The regfon 
may then be re-executed and the eariier load is able to correcdy load the value generated by 
tiie store issued later in the r^oa This store is temporally earlier as it is from a lower 
numbered strand 
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BRIEFDESCRIPITON OF THE DRAWINGS 

Rgute 1 shows an illustration of how strand based execution allows streams of operations 
firom different basic blocks to be executed in parallel 

Figure 2 shows an iDustcation of how operations ftom indivilual basic blocks ate extracted 
and mapped to diflFerent strands in a fixed casual ordering. 

Figure 3 illustrates the relationship of execution phases wiihin a number of strands. 

Figure 4 illustrates how weak and conditional arcs are represented in the graph in order to 
allow conditional scheduling of alias checking operations. 

Figure 5 shows a representation of the relationship between individual regions and the steands 
therein. 

Figure 6 indicates how individual squashes executed in one strand can impact the execution of 
anotiier strand 

Figure 7 indicates how a squash operation affects the strand execution status which 
subsequentiy disables die execution of later operations. 

F%ure 8 illusttatcs the contemporous nature of execution of individual strands and the logical 
view of side entries into a region. 

Figure 9 illustrates how additional squashes are inserted in response to a call being performed 
within a code reg^oa 

Figure 10 illustrates how the strand mechanism is used to effect a call and return fiom a 
subroutine where the call is not performed firom die final strand in a region. 

Figure 11 provides a view of the rebrionships between the operations associated with die load 
and a store. 

Figure 12 illustrates how a sequence of memory accesses mi^t be mapped into two difiFerent 
strands vndi appropriate alias checking operations. 
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Figure 13 piovides an example timeline of how the operations given in Figure 12 mi^t be 
executed in the context of a load and store being aliased. 

Figure 14 provides an illustration of how a number of strands may be aborted and then re- 
executed in order to avoid the data hazard. 

Figure 15 illustrates how a static guard operation may be used to abort a series of strands and 
force a subsequent re-execution. 

Figure 16 provides an overview of die data plane connections to a functional unit inrlnf^^ng 
input and output mechanisms and potential neigjbbour connections. 

Figure 1 7 illustrates how bits within the execution word are used to control various aspects of 
a functional unit on a cyde-by-cyde basis. 

Figure 18 illustrates the distribution mechanism for the sttand status that is bussed to various 
functional units in the system. 

Figure 19 iEustrates die internal architecture of die functional units used for checking aliases 
between addresses that can cause an abort of a strand. 

F%ure 20 illustrates the internal architecture of die functional units used to guard execution of 
certain ffoups of strands. 

F^jure 21 illustrates the internal architecture of die abort handling module within die overall 
strand control unit 

Figure 22 illustrates the internal architecture of the squash lianWIing module widiin the overall 
strand control unit 

Figure 23 illustrates how the abort and squash sttand states are combined in order to generate 
a strand status that may be distributed througjbout the processor architecture. 

Figure 24 illustrates the construction of a control and date flow graph from input sequential 
code and how redundant branches are eliminated as the scope of die code region is 
determined 
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domination relationships of strands. 
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DESCRIPTION OF PKESENTLYPREEEMim) EMBODIMENT 

Overview 

One of the key requirements of the architecture is to support scalable parallelism. The basic 
structure of the micro architecture is focused on that goal Extracting parallelism from highly 
numeric loop kernels is relatively straightforward Such loops have regular computation and 
access patterns that are easy to analyse. The nature of the algorithms also tends to lend itself 
well to parallel computatioa The architecture needs to balance the availability of 
computational resources (such as adders, multipliers) and memory units to ensure the right 
d^ee of parallelism can be extracted. Such numeric kernels are common for Digjital Signal 
Processors (DSPs), The bops tend to lack any complex control flow. Thus DSPs tend to be 
higjhly efficient at regular computation bops but are very poor at HgnHling code witia more 
complicated control flow. 

Other than in numeric computation bops, C and C++ code tends to be filled with 
complicated control flow structures. This is simply because most control code is filled with 
conditional statements and short loops. Most C++ code is also filled with references to main 
memory via pointers. The result is a code stream {com vrfaich it is extreme^* difficult to extract 
useful amounts of parallelism. In average Reduced Instruction Set Computer- (EUSQ code, 
approximate^ 30% of all instructions are memory references and a branch is encountered 
every 5 instructions. 

Hi^ end processors have to deal with this kind of code and extract parallelism firom it in 
order to achieve competitive performance. The complexily of such processors has 
mushroomed in recent years to try and deal with tiiis issue. The control logic of sudi 
processors has literally milHons of transistors dedicated to the task of extracting parallelism 
from the code being executed The extra hardware needed to actually perform the operations 
in parallel is tiny in comparison with the logic required to find and control them. Hie main 
method utilised is to support dynamic out-of-order and speculative executbn. This allows the 
processor to execute instructions in a dififerent order fix>m that specified by the program. It 
can also execute those instructions speculatively, before it is known for sure \chether they 
shouM be executed at al This allows parallelism to be extracted across branches. The difficult 
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constraint is diat the execution most always produce exactfy the same answer as would result if 
die instructions were executed stdcdy one by one in die odgbal order. 

The control and complexity overheads of dynamic out-of-order execution are 6r too hi^ for 
die application areas of die prefetted embodiment The intended application areas , are 
embedded systems 'where diere are code sequences \rfiich contain significant potential . 
parallelism and where die repetition of tiiose sequences represents a hi^ percentage of die 
execution time of the processor. There is a significant cost overhead due to die area occupied 
by the control log^c and die cost of designing it Additionally, such bg^c is not amenable to die 
scalability requirements of die architecture. 

A number of recent developments in the area of micro architecture have been focused on 
.VLIW type architectures. There is a 'Ijack to basics" movement that seeks to place die burden 
of extracting parallelism on the compiler. The compiler is able to perform much greater 
analysis to seek parallelism in die application. It is also considerably simpler to develop than 
equivalent control log^c. This is because the control logic must find the parallelism as the 
program is running so must itself be hig^y pqjelined and suffers fitom the physical constraints 
of circuit design. The compiler performs all of its analyse before the code is executed with the 
advantage of much longer analysis time. For most classes of static parallelism, compiler 
analysis is very effective. 

Unfortunate^, sofi:ware anal]^ is poor at extracting parallelism that can only be determined 
dynamically. Examples of these are branches and potentially aliased memory accesses. A 
compiler can know the probability that a particular branch will be taken &om profiling 
information, but it cannot know for sure whedier it will be taken on any particukr instance. A 
compiler can also tell firom profiling that two memory accesses never seem to access the same 
memory location, but it cannot prove that will always be the case. Consequently it is not able 
to move a store operation over a potentially aliased load operation as that migjbt affect the 
results the program would generate. This restricts the amount of parallelism that can be 
extracted statically in comparison to that available dynamical^. 
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SttandExecutiQa Modd 

Hie ptefetted embodimeat employs a combination of static and dynamic pataHelism 
extraction. This gpres die ardbitecture access to higb d^cees of parallelism without die 
ovediead of complex hardware control structures. The architecture itself is fuDy static in its 
execution model It executes instructions in exacdy the order specified by die tools. These 
instmctions may be out of order widi respect to die sequential program, if the tools are able to 
prove that the re-ordering does not affect the program resuk. This reordering is called 
instmction scheduling and is an important optimisation pass for most ardiitmures. 

This disclosure presents an execution model tiiat also allows it to perform out-of-order 
operations that cannot be proved as safe at code generation time. An operation sequence is 
considered safe if it is guaranteed to produce the same results as the original sequential code 
under all circumstances. In general it only performs these optimisations if it knows that they 
will usually be safe at execution time. The hardware is able to detect if the assumptions are 
wrong and arrange a re-execution of the code that is guaranteed to produce the correct answer 
in all circumstances. The hardware ovediead for this 'liazard" detection and re-execution is 
very small 

Figure 1 illustrates the execution model of the preferred embodiment. The area 103 shows a 
representation of a code sequence composed of basic blocks of instmctions 101. A basic 
block is a segment of code' that is delineated by a branch operation. The condition vpon 
wbxch the branch is performed is generally calculated in the code of the basic block so it is not 
possible to know before entering the block which route will be taken. Each execution of the 
code will produce a particular path of execution thtou^ the basic blocks. Certain paths may 
be considerably more likely than others but any route may be taken. These are related by the 
branches that inay be performed between them. On a gprai execution a particular path 102 
will be taken. The area 104 shows die linear execution of the basic bbcks on a sequential 
processor. Each basic block as to be executed one after the other, as the branches are 
resolved. The area 105 shows a possible execution of die same code on a processor of the 
preferred embodiment Code firom different basic blocks may be inteileaved to make more 
eflfident use of execution resources with the processor and to more eflfectivdy hide the 
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latencies of certain opetations. The choice of which operations to perform in parallel is 
.determined at code generation time. 

Hie processor is able to execute code from a number of diflFetent basic blocks in patafleL It 
does diis to increase tihc amount of parallelism and eflBdency of the architecture. It mi^t 
know that one basic block is very likety to follow another. It can pull instructions forward 
from the second block to execute in paralld .witii die instructions of die first This allows 
calculations to be started earlier (and tiius finish earlier) and to more eflfectively balance the 
resource utilisation of the processor. The scheduling algoridim does diis wiaic taldng account 
of the probability that the execution of a particular instruction wiH be usefioL 

Executing code speculatively in this manner normally requires a significant hardware 
overiiead. If a particular block should not have been executed then any results it has produced 
must be discarded This is referred to as "squashing" the execution. In particular, any store 
operations that the code has performed must be undone as they could permanendy poUute 
die memory space with incorrect results. Mechanisms are empbyed that allow the benefits of 
speculative execution while only requiring the minimum of hardware overhead. 

A strand represents a particular sequential group of operations that is being executed on die 
machine. Many sttands may be executed simultaneously. Each individual operation that is 
performed bebng^ to a particular strand VC^enever an execution word is issued it may 
coni^ operations that are associated widi a number of difiEerent strands. 

The strand mechanism performs two main purposes: 

Firstfy, it implements a predicate mechanism. This allows conditional blocks of code to be 
integrated into a r^on in \diich all operations are executed Each strand has an associated 
predicate flag. By defeult this is set to tme bxit the strand may be "squashed" and this flag set 
to 61se. If a strand is squashed then all operations from the strand are essentialfy nullified as if 
they were never executed Thus it provides a means to incorporate conditional code into an . 
atomically executed region. Thus the pipeline stall hazards of conditional branches are 
avoided 

Secondly, it provides a fiiamework to implement a low cost speculatrTe execution mechanism. 
The code scheduling may make code motions that violate tiie operation ordering rules for 
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inter-sttand dqpeadeacies. The haidware iacorpotates simple detection mechanisms to 
detemiine when a dependency hazard has actually occurred The numbering of tlie strands is 
in accordance to the temporal ordering of the operations in the sequential program. Tliis 
temporal ordering may be at odds widi the physical ordering in a code schedulmg if some 
types of speculative code motions have been performed Since each operation is widi die 
strand to which it belongs it is possible to determine the correct temporal order of operations 
without any additional hardware anatysis or complex instruction reorder bufifers. Recovery 
&om a dq)endency hazard is relative^ easily achieved by re-execution of a r^on. This is done 
a number of times witii certain strdnds squashed This albws tiie orig^ temporal order to be 
reproduced 

Strand Enabled Paialldism 

Hie use of strands provides s^nificantiy improved scope for parallel code instruction 
execution. Figure 2 illustrates the relationship between strands and basic blocks: 

Area 201 shows the individual operations 202 within the basic blocks as well as the control 
flow tektionships of die basic blocks 203. The last operation of a basic block may be a 
branch that determines which basic block will be executed next Each of the basic blocks is 
allocated a different strand number shown in die blocks 20Z 

Area 205 shows a potential mstruction schedule g^etated for die processor. Operations 202 * 
from different strands (Le. basic blocks) may be issued during the same dock cyde. The 
original strand number is shown in each operation 202 Each of die operations is issued from 
a particular groiqs within die execution word 204. The order of operations within a particular 
strand is always maintained The order of operations between strands does not have to. be 
ma i n tai ne d, albwing much greater sdieduling freedom for die architecture. Althou^ the 
scheduler is able to perform operations out of order between strands it will onty do so if diat 
is unlikety to lead to a hazard The hardware is able to recover from a hazard but there is a 
performance penalty ix> doing so. 

This mechanism allows instructions to be issued out of order. However, if die correct results 
are to be produced by the architecture dien die data flows between strands tiiat would occur if 
diey were executed in the correct order must be maintained. Area 206 shows that the logical 
Older of the strands is always the same e.g. a result generated by an operation in strand 3 
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should never influence the calcukrions petfomied in sttand 2 That could never happen if the 
instructions were executed in otdct. In effect the architecture has two diflGsrent time domains. 
There is die physical time domain of die order of instruction execution. There is also die 
togical time domain diat was present in die sequential progtam that must be preserved in 
order to maintain die sequential progtam semantics. 

Strand Phases 

Each strand has three distinct phases of execution that occur during the execution of a re^on. 
A strand is initiated in its speculative phase. During this phase operations from the strand may 
be executed that are either control and/ or data speculative. An operation is control speculative 
if it is executed before it is known whetiaer die strand to \^ch it belongs should be executed 
at alL An operation is data speculative if it is operating upon data that might not have its 
correct vahie as expected at the start point of die strand in the program. Operations issued in 
the speculative phase of the strand may have to be re-executed any nximber of times. Thus 
they may not change the state tiiat may be read by the re-^ecution, as that would invalidate * 
die results of the second and subsequent executions. For example, register file writes and 
memory stores may not be executed in the speculative phase of a strand. 

At some later point each strand enters a provisional phase. At die start of die provisional 
phase the squash state of die strand is finalized. In other words, if the strand is going to be 
squashed then it will have been so by die start of the provisional phase. The sttand may only 
be disabled afiter it has entered the provisional phase by an earlier branch that would also 
squash all hi^er numbered strands. Certain special types of operation may be issued in the 
provisional phase, such as squash operations and branches. 

At some further later point each strand enters a committed phase. During tHis phase it is 
known that the sttand execution state is definite^ in its final form and the entry conditions in 
terms of register and memory state are correct for die strand's position in the code. 
Operations that permanentfy change die machine state may be executed during the committed 
phase. 

The division of strands between the phases allows speculative operations to be performed that 
can be squashed widiout permanently changing die machine state. This gves much greater 
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scheduling freedom and thus potential patallelism. Moreover, the mechanism requires very 
Htde hardware overhead 

Strand Phase Transition 

A strand changes state from its ^eculative to provisional phase on die first cyck after wiiich 
all preceding squash operations would have influenced the execution state of die strand No 
e^licit operation is required to indicate the state transition. 

A strand transitions from the provisional to committed phase when all operations that can 
disable die strand have been executed and have impacted the strand state. Again, there is no 
requirement for an eqjlicit operation to indicate that the committed phase has been entered 

An illustration of the transition between the phases is shown in Figure 3. Each of the bbcks 
301a-e is a separate strand These strands may be simultaneous^ active and operations may be 
issued from each of llie strands in an interleaved matmer. Each strand may have up to tiiree 
different phases tiiat determine the type of operations that may be issued in die strand The 
order of tiie phases is fised and transitions from speculative 302, provisional 303 and 
committed 304. 

R^ion Based Execution 

A region is a groi^ of consecutive execution words to be executed as an atomic unit 
Execution of a region may onfy enter with the first execution word and may onty exit at the 
last word Thus all words within a regbn are always executed if the region is entered This 
property considerably simplifies the code gpieration and scheduling process. 

Within a particular regfon a number of branches may be issued Hie branches do not have any 
effect until the end of the region is reached At that point the hardware is able to resolve 
between many possible following regions. TTius the hardware handles a multi-way branch. 
G)llapsing multiple branches into a sin^e decision point allows the architecture to execute 
code efficiently without the need for complex branch prediction mechanisms. 

The r^on is a ftindamental building block for the architecture, both dimng the code 
generation process and during code execution. Each software ftmction is composoi of one or 
regions. A region is the basic unit of execution granularity. A r^on may only be entered via 
its first execution word and all execution words within it will be executed A number of 
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blanches may be issued and they ate resolved so that a sing^ sucx:ession tegion is selected at 
die end of die tegion execution. 

M strands are limited to die lifetime of a smgje regioa Hie architecture is able to execute 
operations out of order within a particular r^on. Out of order execution and any resulting 
hazards are resolved at die end of die region and dien execution continues on to anodier 
region, \rfiich may itself issue operations out of order. 

Figure 5 illusttatcs an example set of regions and die relationships between tfiem. It shows die 
execution of the individual strands within each r^on. There are a number of re^ns 501 
composed of individual execution words 502. Witiain each rcgbn there may be multq}le 
strands 503, 504 and 505. Hie number of strands is only limited by the code and the capability 
of die hardware. Each region may have a number of possible successor regbns 506 
depending on die branch diat is taken. 

If a hazard is detected during execution tlien the sequential semantics of the strands have not 
been propeify preserved The architecture must be able to recover from this situation with as 
litde overhead as possible. 

Upon detecting a hazard in a particular strand die results generated for that and any later ^e. 
higher numbered) strands may be incorrect The architecture allows execution to continue 
until the end of the region, when die strands will be completed Any results ftom the hazard, 
and any hi^er, strands are discarded The architecture dien re-executes the code horn the 
start of the r^on again. Since lower numbered strands have already been successful^ 
completed they are not executed a second time. The architecture includes Idgc to block 
operations from those strands. Since the lower strands have completed and gpnetated dieir 
results the hazard strand is able to execute correcdy, utilizing any required results from the 
tower strands. If anodier, even higher numbered, strand generates a hazard then the r^^n 
may be repeated a second time and so on. When all strands have successfully completed the 
processor m^ move onto the successor regioiL 

The goal of the architecture is to execute all strands successfully on the first attempt. The 
compiler does extensive analysis to ensure that die chances of hazards are smalL The key is 
tiiat the compiler doesn't have to prove diat a hazard cannot happen. The re-^ecution 
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mechanism eaisure conect completion of the strands if lequired. It does this,\pith a 
minimum of hardware ovediead. The size of regions is limited to a fisw tens of instructions so 
that die overhead of any re-execudon of the region is not too gteat 

Squash Mechanism 

This section descdbes die operation of die squash mechanism in more detail A squash is an 
operation daat may be issued tiiat impacts the status of odier stiands. Ihe squash is 
conditional ^e. the squash onljr occurs if a certain condition evaluates to due). This 
mechanism allo^vs strands to be dynamically squashed on die basis of dynamic data and dius 
influence die execution of other operations. In cflfect this allows conditional code tfiat mi^t 
otherwise be branched arovmd due to a condition to be folded into a single contiguous stream 
of operations. 

Figure 6 ilbstrates the squash mechanism. The box 601 represents a r^on composed of 
individual operations 60Z Each of the operations belongs to one of a number of strands 608 
within the r^on. A strand itself is composed of a number of individual operations diat 
commiuiicate via data flow. In general an operation from any strand can be issued at any point 
during die regj^on execution. However, there may be ordering dependencies between strands 
diat result in operations from hi^er strands being issued kter in the r^on schedule. The goal 
of the scheduling process is to overi^ the execution of multiple strands. 

A gjiobal strand state 606 is shown afi^r each execution cycle of die re^on. There is a single bit 
for each strand that is 1 if it is enabled and 0 if it is disabled This is initially this is set to all Is 
to enable all strands. In this example die status for just five strands is shown. The Strand 
Execution Mask (SEW^ state is distributed to all of the functional units within die architecture 
diat are responsible for executing operations. 

An individual operation may be a squash diat influences die execution of anodiet stcand. This 
illustrated by a squash 604 of strand 1 and anodier squash 607 of bodi strands 3 and 4. A 
ntamber of dock cycles after die squash is issued (and its associated condition is true) die 
appropriate strands are squashed This prevents operations from diose strands being executed 
On the diagram they are show as bang shaded 603. 

There is an ordering to die entry of conomitted phase for each of die strands shown as the 
timeline 605. 
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Instmcdon Execution 

The status of strands is used to influence (he execution of individual operations performed by 
die architecture. Each operation tiaat is performed bebn^ to a particular strand. The 
opetations must be scheduled statically widiout kno\rfedge of ^ch particular strands will be 
enabled or not at run time. Thus the hardware must read die state of strands dynamically to - 
influence die operation execution. If a strand is disabled tiien any impact the operation has on 
the permanent state of tiie architecture must be avoided Hiis is logically equivalent to the 
operation not being executed at alL However, tiiis allows a fixed schedule of operations to be 
used that are able to dynamical^ respond to the impact of squashed strands widiout any need 
to perform branches. 

Figure 7 illustrates die hardware mechanism that is used to show the relationship between 
individual operations and didr strand membership. The box 701 shows a rqpresentation of a 
region and die execution words witiiin it A new execution word is executed on each dock 
cyde. Each execution word is composed of a number of Individual operations. Such 
operations may be composed of a number of operation bits 702 and a strand tag 704. This 
strand tag g^ves die number of the steand to \rfiidi die operation bdongs. In this exmiple 
diere are a ma x imum of 1 6 strands and thus a 4 bit strand number. The operation bits control 
various otfier aspects of the operation execution. 

Every operation tiiat is issued uses its strand tag to compare against the SEM state^ which is 
distdbuted gjiobally to all functional units. If the relevant strand is disabled then the execution 
of the operation will not have any permanent impact on die state of die architecture. Thus 
operations diat perform memory or register writes must be sipprcssed in this manner. Certain 
other types of operation never perform operations that aflfect register or memory does not 
need to be si:f>pressed and can be executed \(^edier the strand they bdong to is enabled or 
not Such operations do not need a strand tag and thus the execution word width can be 
reduced. 

The SEM state is shown as 705 for each cyde of execution. An example squash 703 is shown 
being issued in the first dock cyde that squashes strand 0001. This dears the bit in the SEM 
706 so diat subsequent operations issued to strand 1 are disabled These disabled operations 
are shown as being shaded in the figure. 
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QxleGenetation 

This sectbn desadbes the subdivision of the sequential code to be run on the architectute into 
individual tegions. Each tegion is itself coi^posed of one or mote sttahds. The efficient 
Creadon of regbns and strands fix>m the application code is key to obtaining good 
performance firom the architecture. 

Calling Subroutines 

The subroutine calling mechanism used in the presence of stcands is illusttated in Rgure 10. 
This performs a call ftom a reg^n 1001 to a called r^on 1002. The flow of control is shown 
as 1003. A call is implemented as a standard branch on the architecture. Before executing the 
branch the host link reg^ter is loaded with the return address. This is the address of the 
instruction following the calL This is normally performed by the program counter bdng 
copied to the link register as a side effect of a call A branch representing the call 1004 is 
executed and causes control to be passed to the called regpn 1002. 

The called region is executed 1005. All strands are enabled. The link register may be preserved 
on the stack &ame if further calls are to be made. When the end of the called function is 
reached (which may or may not be in the r^on initially callec^ a return instmction is 
encountered 1006. This is effectively an indirect branch using the link register as a destination. 
The location loaded is tiie address of the instmction following the call This is a reference to 
the calling r^on 1001 vrith a strand number hi^er tiian die strand containing the origuial 
branch call 

The mapping contains both a region address and a first strand number. The genetalty occurs 
to the same region that ori^nally initiated the call. However, the strand number is set to one 
more tiian the strand that caused the call Thus the calling strand and any earlier strands are 
squashed In some cases \diere performance is higjily critical the return may be made to a 
diflferent strand that does not contain earlier squashed strands. A branch is made back to the 
calling region 1007. Rnalty execution can continue out of that region 1008. 

ReadSpecuhdon 

This section describes die mechanisms used in the preferred etzibodiment to perform read 
speculations. These allow a particukr strand to execute with speculative data. If it is later 
determined that the data used was incorrect then a combination of software and hardware 
mechanisms allow die correct results to be recovered. This static speculation mechanism 
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utilises the properties of strands to affect this behaviour xrith a tnitiimnm of hardware 
ovediead. The abiEty to perfortn speculathre operations is kqr tx> obtaining hig^ performance 
dirou^ parallelism. 

Memory Dq)eo.d£ncies 

One of die most difficult aspects of the architecture is maintaining memory dq>endendes. 
These must be preserved between memory accesses diat mig^t potentially access the same 
memory location ^e. they may be aliasec^. If the compiler is able to prove that two memory 
accesses cannot be to the same location then they may be arUtradty reordered -without 
affecting the results generated firom the proffsm. 

The code generator must observe dependencies between memory accesses that may be to the 
same location. 

The dependencies that must be observed are as follows: 

Write-Afier-Read (WAR): Writes must not be scheduled earlier than a preceding read that 
may be aliased 

Write-After-Write (WAW): The order of potentially aliased writes must be maintained 

Read-After-Wdte (RAW): Reads must not be scheduled earlier than a preceding write diat 
may be aliased 

Write operations may only occur in the committed phase of a stxand. Maintaining these rules 
significant^ restricts the amount of parallelism available to the architecture. 

Software analysis is unable to derive much information for a large proportion of memory 
accesses. These would have a major detrimental effect on the potential parallelism in the 
architecture, llie strand mechanism is used to albw these to be executed out-of-order in 
many circumstances. 

Dependencies between memory operations within a strand are always preserved That is, the 
order of such instmctions is not changed The scheduler also maintains WAR and WAW 
dependencies for memory operations between strands. 
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The majority of perfomiance benefit caa be obtained by allowing RAW dependencies to be 
teordered. Hiis allows a read to be sdieduled eariier than a potentially aliased write. Most 
chains of calculations start with a read operation, and this allows the calculations to be started 
earlier and be potentially executed in parallel with otter code. If the read and write arc not 
aliased then the correct result is produced. li^ however, they are aliased then the read should 
have obtsuned the data stored by the write. The ^ole chain of calculations will produce an 
inconrect answer. 

Whenever a potentially dependent read is moved before a write a checking operation is also 
present This compares the addresses of the read and the write. If they arc identical ^e. there 
is a hazarcQ then the strand containing the read is aborted The earlier strand containing the 
write is aDowed to complete and then tte whole region is repeated The read is then able to 
pick up the value written by the eariier strand and the correct result is produced Ibis whole 
process requires tbe minimum of hardware overhead No additional compensation code is 
required to recover firom a hazard situation. The underlying microarchitecture and execution 
model handle it automatically. 

Check Hazard Instructions 

Check hazard (chaz) operations ate used to vaBdate the boosting of loads over stores. They 
must be present ^enever a read has been boosted over a previous write. It checks for an 
address conflict and aborts the readmg strand if so. All chaz instructions must be issued in the 
speculative phase of the reading stiand while all writes must be issued in the writing strand's 
committed phase. 

A particular processor may svpport a number of individuai chaz units. A chaz unit simply 
compares two addresses and generates an abort if they match. This is used to compare tiie 
load and store addresses and abort the load strand if there is an address match, 

A chaz operation has two operands, a left operand that is the store address and the rigjit 
operand that is die load address. The execution strand number is obtained iSrom tiie left 
operand If the store strand is not being executed then the operation is disabled 

The number of the strand to be aborted is obtained from die strand associated with the right 
hand operand (the bad address). 
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Figure 12 shows an exampk of tead speculation being petfotmed The oi%inal code is shown 
1206. Two write opet^itions are followed by two read operations. Any of the four addresses 
may in feet be aliases to the same memory address. Software analysis before execution is not 
able to determine if those locations could be aliased or not 

The scheduled code is shown 1207. The operations are split into two distinct strands 1208 and 
1209. Each of the strands consists of a speculative phase 1210 and a .committed phase 1211. If 
read speculation is beii^ performed then it must be done on an inter-strand basfe. Speculation 
between accesses within a strand is not possible as there is no way for the hardware to 
perform the required selection renexecution of code. The two write operations are placed in 
die temporally first strand 1208. This is because the writes are before the reads in die original 
code. The reads are placed in strand 1209. However, the placement into diflferent strands gives 
the code scheduler die fi:eedom to move the reads cariier dian potentially aliased writes. The 
read 2 1201 is moved eariier than botii writes and the read y 1204 is moved earlier than the 
write X. This boosting allows improved schcduliog freedom and thus greater parallelism and 
performance. 

If any of the locations are indeed aliased then this code sequence will produce incorrect 
residts. Thxas chas: operations must be inserted to detect hazards and initiate the recovery ficom 
them, A cha2 is required for each write that a read is moved across to which it inay be 
potentially aliased The read 2 has two cha2S 1202 and 1205 for comparing a^inst both the w 
and X values. The read y has a singjb chaz 1 203 for comparing against the x value. 

Figure 13 bdow shows an example execution of the same code. Hie instmction stream 1301 
shows the operations being performed and the memory content 1302 shows the value of 
memory locations at that point of execution. The operations are ta^ed witii the strand that 
tiiey belong to. TTie memory content is shaded as the value in memory is updated by a write 
operation. 

In this case the memory locations x and y are aliased (Le. actualty point to the same memory 
location). The write w operation 1306 is performed as it is 6x>m the enabled lower strand. 
When the cha2 operation comparing die x and y vahies 1305 is executed a ha2ard is detected 
diat disables the iqjper strand. Since the cha2 is in strand 1 all operations in strands 1 or hi^er 
arc disabled from that point of execution. Since all cha2s for a strand must occur before any 
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wdtes (since chazs ate jn the speculative phase and writes can only be petfotmed in the 
committed phase) the sttand not have changed the memoty contents. Despite the hazard 
execution continues until the end of the region ^ecution 1303. The values of locations w and 
x/y ar^ i^pdated by the wdtes in sttand 0. 

When the end of die tegion is reached the previous hazard causes the region to be re^ecuted 
1304. If no hazard was detected because there was no aliasing then all strands would have 
been successfully completed duritig the initial execution. On the re-execudon, due to the 
RAWhazard, sttand 0 is disabled since it was previously conq>leted. Thus die writes 1308 and 
1310 are not ilkg^iimiately rq)eated a second time. The x and y locations will still be aliased on 
the second execution. However; the chaz comparison 1307 will not be performed since the x 
value is computed in strand 0. Since diat sttand is not executed the chaz will detect die strand 
tag firom the x value and disable the chaz. Thus the chaz will not cause a hazard on the second 
executioa Hie read y operation is performed and reads die value written by die wnte x 
operation on tiie first execution of the regbn. Thus even tJiougjh the write x operation occurs 
after the read y operation in the r^on schedule,'the re-execution allows the later wtite to pass 
data to the earlier read if necessary. 

Sttand Abort Mechanism 

Figure 14 illusttates the strand abort mechanism that is used to allow recovery &om hazards 
detected by check hazard operations. The figure shows a r^^n and the execution during the 
abort and then during the subsequent restart required to recover firom the abort If such a 
hazard occurs then a sttand is performing a speculation that is invalid. The strand must be 
aborted before it enters its committed phase and must be subsequentiy re-executed with the 
cause of die hazard removed. 

The initial execution of the region is illustrated in 1401. It is composed of a number of strands 
1407. A particular chaz operation 1404 compares two addresses and generates an abort if they 
match. This is illustrated on the diagtam with certain operations becoming.aborted 1405. All 
of the aborts must occur before the commit point of the strands 1408. The .SEM state is 
shown on'a cycle by cyde basis 1403. The abort has an immediate impact on the SEM state by 
disabling strands 2 and above. Hie presence of aborts causes the r^on to be restarted once 
the current execution of the region has conq>leted 
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Tbe area 1402 shows the testart execation of the region. Hie strands 0 and 1 will have 
completed their execution and entered their commit phase as part of the iniri<>1 regjbn 
execution. Thus i^en the re^on is testarted those strands should not be executed again. They 
are marked as squashed 1406. Howevct, the strands which were aborted can be re-run. This 
is illustrated by the initial SEM state tiiat is shown for die restart The region is executed again 
but witii die first two strands disabled. If die reg^n completes widiout any fbther aborts dien 
die next regjion can be executed, odierwise fiirdier restarts may occur. 

The initial^ g^erated abort will have been caused by a hazard between tibie aborted strand 
and a write on an earlier strand. Since diat earlier strand will not be executed as part of die 
restart the same abort will not occur a^in. Moreover die read on die restart will obtain die 
data that was wdtten by die lower numbered strand, even dxoug^ that write is fiom an 
operation diat is later in die schedule. 

Thus this mechanism provides means ftom recovering &om hazards without any requirement 
for alternative code for specialty handling hazards. The sttand and abort detection mechanism 
is tosed to re-execute the region the required number of times so that there are no hazards and 
the correct results ace ultimately obtained. 

Strand Guard Mechanism 

The guard medianism also uses the concept of a strand abort, but for the purpose of 
improving code density rather than recovering from bad speculations. Figure 15 ilbstrates the 
use of the strand guard mechanism. Area 1501 shows the first execution of a r^on. The 
region contains a number of strands 1507. Strands 3 and 4 have been scheduled without 
reference to dependencies on the eariier strands. This allows more unconstrained motion of 
operations from those strands and allows them to be issued in earlier slots in the schedule, 
reducing the overall schedvJe lengdi and improving the utilisation of the execution word. Thus 
the code density is improved. However, strands 3 and 4 cannot be executed in the same 
region execution as the lower strands as data generated by them may not be read later in the 
schedule by strands 4 and 5. A guard operation 1506 associated with strand 3 ensures that the 
higjber strands are aborted if the earlier strands are being executed. The SEM state is shown on 
a cycle by cycle basis as 1505. The lower strands are being executed so strands 3 and 4 are 
aborted 1503 as shown and influence the SEM so that their operations do not vftdaxc the 
machine state. 
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Since an abort occurs the region is re-executed, as sho^ as 1 502. In this case the strands that 
were completed (strand 0-2) are now disabled as they should not be executed again. Since 
strand 3 is the first sttand being executed the guard unit no longer generates an abort and 
allows their execudon to continue normally. 

Strand Managenouent Hatdware 

This secdon describes the detail of the hardware mechanisms used to implement strands. 
Funcdonal Unit Data Plane 

The internal architecture of functional unit is shown in Figure 16. This diagram concentrates 
on the data plane of the functional unit That is, it shows the buses used to pass application 
data form one imit to another. A separate control plane is used to influence how that data 
flows and the operations that are performed upon it 

The central core of a functional unit is the execution unit 160Z It performs the particular 
operation for the unit New functional units may be created using user defined execution 
units. The Uocks 1605 and 1606 are instantiated in order to form a functional unit 1601. 
These blocks allow the functional unit to connect to odier units and to allow die unit to be 
controlled firom the execution word 

Each operand-input 1606 to the execution unit may be chosen &om one of a number of 
potential data buses using a multiplexer. In some circumstances the operand may be fixed to a 
certain bus, removing the requirement for a multiplexer. The multiplexer selections are 
derhred firom the execution word.The number of selectable sources and the choice of 
particular source buses are xmder the control of tiie architectural optimisation tools. The data 
is presented to the execution unit on the operand input buses 1603. 

All results from an execution unit 1604 are held in independent output rasters 1605. These 
drive data on buses connected to other functional units. The output reg^ter holds the same 
data until a new operation is performed on the functional imit that explicitiy overwrites the 
renter. 

Functional Unit Control Plane 

Rgure 17 shows greater detai of the control ^lane for a functional unit The area 1703 shows 
the bits wittiin the execution word tiiat are used to control die functional unit This is 
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composed of a, number of diflfetent sections for describing different aspects of the operation 
to be performed by the functional urit This field is formed firom a subset of bits fix>m an 
ovetall execution word used to control all of the functional units \rithin the architecture. 

The sub-field 1704 controls which of the output registers should be updated with a result The 
sub-field 1705 is used to control the operand multiplexers 1711 to select the coaect source of 
data for an operation. This datais fisd to the execution unit via the operand inputs 1709; The 
sub-field 1706 provides the method diat should be performed by tiie execution unit and is fed 
to the unit via the input 1710. 

The optional sub-field 1707 provides the number of the strand to which the operation ' 
belongs. This is . used to select die corresponding status bit firom the Strand Enable Mask 
^EM) 1712 via tiie multiplexer. This is a global state accessible to all functional units that . 
dynamically indicates \^ch strands are currentiy active. This is used to condition the 
functional unit select so that if a particular strand is disabled (its SEM bit is 0) then the 
functional unit operation is not performed. This is the means by \diich the sttand status is 
used to influence the conditionality of particular operations. Certain functional units may be 
executed unconditionally, independentiy of the sttand status. Such units do not required such 
a strand field. 

Hie opcode bits 1708 are used to select the particular functional unit If the opcode does not 
have the required value then all of the other bits are considered to be undefined and the 
functional unit performs no operation. 

Squash Unit 

This section describes the squash unit This is a special type of functional unit that allows the * 
strand state to be modified by the values in the data plane of the architecture. Figqre 1 8 shows 
examples of squash units in the architecture 1801. The example shows two dififerent squash 
units 1803 in the fimctional unit array. Any number of squash units may be supported in order 
to si:5)port the level of operation parallelism that is required in the architecture. Each squash 
unit has a status input 1806 and a method input 1807. The status is generated as a result of 
some previous data operation, such as a comparison, that is to be used to infhience the 
conditionality of subsequent code. The method shows die type of condition of that is being 
evaluated &om the status. The output 1805 of the squash unit is to the gbbal Sttand Control 
Unit 1804. 
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The Starand Control Unit 1804 is responsibk for combing the squashes ptnqngring from a 
number of squash units in the architecture. It then updates the sttand status according in the 
fonn of die Sttand Enable Mask ^EM) Mrfiich is distributed 1802 to various functional units. 
This is a status bus with a single bit for each sttand. This indicates if the strand is currentiy 
enabled or not The SEM is distributed to many functional units witiain the architecture and is 
used to affect their operation as described previous^. Thus the action of a squash unit, can 
difectiy impact the execution of various subsequent operations. 

Check Hazard Unit 

The Check Hazard Unit is a special unit that; lilse squash, is used to impact the state of the 
SEM and thus the execution of various subsequent operations. However; unlike the squash 
unit; this unit is used to detect transient violations of the required ordering dependencies 
between operations. Figure 19 shows a number of Check Hazard Units 1902 present in an 
example architecture 1901, 

The Check Hazard Unit is generates aborts when it is supplied with two addresses tiiat match 
These addresses represent memory locations that are being accessed by cert^ memory 
operations. If these memory operations are accessiog the same locations then a hazard has 
occurred that influences the execution of subsequent strands. The abort information 1905 
ftom each of the Check Hazard Units is sent to the Sttand Control Unit 1903. 

The Strand Control Unit 1903 collates all of the iacoming abort information and updates the 
Strand Enable Mask ^EM) accordingly. The aborted strands are marked as disabled in the 
SEM which is distributed 1904 to the functional unit array. Unlike a squash, however, the 
abort only occurs transientiy until the end of the region execution. After the region has been 
completed the region will be re-executed and the previously aborted strands will be enabled 
ag^ 

GuatdUnit 

The guard unit is a special unit^ like the squash unil^ tiiatis used to intact the state of which 
strands are enabled. Guard units are used to disable strands that are being executed out of 
order with respect to other strands. Such out-of-order execution is used to pack a greater 
number of operations within a region and thus achieve a greater code density. Use of guards 
allows trade ofis to be made between code performance and code density. 
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Figure 20 shows two difietent guard units 20Q2present ia the functional unit army 2001. Each 
guard unit examines die strand number in y^Mch. it is being issued 2008 and die status of all 
die stcands 2007. The unit determines if diat is die bwest strand diat is being executed using 
die comparator 2006. If so and die enable 2009 is set dien it is valid for die strands diat are 
being guarded to be executed as there are no bwer active strands that could influence its 
executioa If die guard strand is not the lowest being executed dien the strand (and all h^er 
strands) must be aborted. This prevents the guarded strand being executed out of order. 

Hie strand control unit 2003 as a number of incoming abort buses 2004 firom check hazard 
and guard units in the system. When an abort is received the aborting stcand and all hi^er 
strands are aborted. This influences the state of the Strand Enable Mask (SENQ generated 
fix>m the strand control unit This is distributed 2005 to functional units in the array. In turn 
this prevents any operations ftom those or higher strands being executed 

Strand Control Unit 

This controls the execution of strands within a particular region. The unit generates a vector 
of the current strands that are being executed. Functional units use the vector in order to 
squash operations &om disabled strands. The unit also detects conditions diat require the 
current regpn to be' re-executed to preserve memory dependencies. 

Strand Abort Mask (SAM) 

The SAM is a bit mask wirii a single bit for each strand. If a bit is reset then that indicates that 
the corresponding strand has been aborted A strand may be aborted due to a check hazard or 
guard instruction. A strand abort indicates that the there has been a dependency violation and 
the strand must be re-executed (unless the strand is squashed). 

A number of individual abort buses are supported to albw the use of multiple units that may 
generate stcand aborts. Such units include check hazard and guard units, as previous^ 
described Tlie logic for updating the SAM in response to these aborts is shown in Figure 21 . 

The box 2101 represents the overall abort handling section of the strand control unit Each 
abort unit produces a stiand number to be aborted 2107. For each abort strand 2107 there is 
also an enable bit 2108. The appropriate bit from the Strand Predicate M^k (SPM) 2106 is 
examined using the multiplexer 21 04. If ^e strand is akeac^ squashed then no action is taken. 
If the stcand is not squashed then that strand and all higher numbered strands are aborted 
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Hig^o: numbered sttands have to be aborted since, if a strand produces an iacorrect value due 
to a dq>endenc7 violation, dien hig^hier stcands may be polluted widi die incorrect value. The 
units 2105 generate a mask of set bits &om the supplied strand number and above. The valid 
flag fix)m the abort unit gates die strand mask. All die aborts are combined and are vised to 
mask ofiFbits in die SAM. 

Note diat if a strand is aborted but later squashed dien hi^er numbered strands isdll have 
already been aborted. This may cause a regbn restart that is unnecessary. Thus it is genera% 
better to resolve die squash status for a strand before issuing any aborts for it 

Upon entry to a region the SAM is initially cleared to the state of the sttands being aborted on 
that cycle. Since aborts will not generalty occur on die last cycle diis resets die SAM 2102 to all 
set bits. The end of region flag 2103 is supplied to a mult^lexer to implement this behaviour. 

At die end of region execution the SAM is combined \rath die SPM to detettnine if any un- 
squashed sttands were aborted If so dien die r^on is icestarted. 

Strand Predicate Mask (SPM) 

The SPM is a bit mask with a sing^bit for each strand. If a bit is reset dien that indicates diat 
die corresponding strand has been squashed and it should not be committed. Bits vndiin die 
SPM are deared by squash operations. A stn^ squash operation may dear a bbck of bits 
widiin die SPM. Once a bit is reset it may not be set once again for die duration of die region 
execution. 

A squash is able to dear multiple bits in die SPM in order to effidently support nested 
conditional constructs. The top levd strand must squash all the strands ^vidlin die conditional 
construct if the top levd conditional is fialse. This is because squash operations issued in 
squashed stcands are disabled. Squash operations diemsdves may only be issued in the 
provisional or committed phase of dieir home strand, as thejr cannot be executed speculative^ 
since th^ have an effect on SPM. 

The branch control unit may also generate squashes. Such squashes are die result of branch 
resohitioiL If a branch, from a particular strand, is taken tiien all hi^er numbered strands are 
squashed This prevents their execution as they are not logically reached since an earlier 
branch is taken. 
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The bit with the SPM assodated with a particukt strand should be reset befote that sttand 
caters its committed phase. It is illegal to squash a sttand when it has already entered its 
committed phase. 

The SPM is ipdated on each cycle on the basis of squash vectors generated by the squash 
units in the processor. A number of separate squash units may be supported in order to 
provide greater parallelism in the processor. Each squash unit provides a vector of strands to 
squash. These are combined and are then used to dear bits in the SPM. The SPM is read on 
each cyde in order to form the SEM. The SEM sho\vs vdiich strands are current^ enabled 
and is distributed to all functional units. It used to disahk operations issued for disabled 
strands. 

The logic for updating the SPM is shown in Figure 22. The box 2201 represents the squash 
handling section of the overall sttand control unit A nimbet of squash input vectors 2205 are 
supplied from the squash units in the functional unit array (not shown). Additional squashes 
2206 may be generated by a debug control unit (not shown). Also, additional squashes 2207 
may be generated by branch control unit The squashes are combined to update the SPM 
2202 Upon die initial execution of a region the SPM 2202 is initialised with a value 2203 tiiat 
is representative of the first strand from the region diat should be executed All strands bdow 
the entry strand has their SPM bit cleared to squash them. All higher hits are set so that the 
sttand is enabled. Normally the entry sttand number is 0 so that all bits in the SPM are set A 
non-2sero entry strand may is used when entering a regjbn througji a side entry. SSide entries are 
used to support returns from calls within a reg^oa This mechanism allows the return to be 
made to the start of the calling r^on. The calling strand and all eadi^ strands are disabled so 
diat they are not re-executed. 

The end of r^on flag 2208 and SAM 2204 are combined so that at the end of execution of 
die region lhc SPM is updated appropriately. Thus if no strands vdiere aborted then the SPM 
is deared However, any strands that were aborted maintain an vmchanged SPM state. Thus if 
there are any un-squashed but aborted strands tiien they will have the corresponding bit set in 
the SPM. If the SPM is non-zero then the region is restarted in order to execute those strands. 
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Strand Enable Mask (SEM) Calculation 

Hie Strand Enable Mask (^BM) is calculated on each dock cyde and distributed to all 
functional units. It is used to mask all operadons issued to strands that are disabled. This 
prevents disabled strands g^eradng permanent side effects &om their committed phase. 

The SEM is calculated firom a number of individual components: 

SPM: Any strand that has been squashed is disabled Once a strand has been squashed then 
no further operations are executed from it There is a delay of a few dodc qrdes between 
issuing a squash operation and the effect being represented in the SEM value distributed to all 
functional vinits. The squash must therefore be calculated some dock cydes ahead of die 
affected strand entering its committed phase. 

SAM: Any strand diat is marked as being aborted is disabled. The SAM is temporary and is 
reset if the region is re-executed. 

The cakuktton of the SEM is detailed in Egute 23. The SAM 2301 and SPM 2302 are 
combined on a cyde-by-cyde basis to produce the SEM 2303. 

Diviaon into Regions 

A region is formed from a contiguous sequence of original sequential instmctions. The 
number of instructions that are included within a re^on is dependent upon many Actors, 
espedalty die control flow topology of the induded instructions. A region is not terminated by 
branches or calls and dius encompasses a much greater extent than a basic block. 

Hie structure of an example region is shown in Figure 8. A r^on 801 has one or more entry 
points and one or more exits. It is composed of a number of sttands 802. Execution always 
continues until tiie last execution word in die region. TTie set of region exits 805 are the 
strands diat initiate die brandi from die region. The main entry point 803 for a regjion is* 
always to strand 0. When a branch is made to a re^on a base strand number is specified. This 
is die first strand diat is executed and lower stcands are automatically squashed 

In addition to the main entry point 803 a r^on may have a number of side entries. A side 
entry is simpty an execution of die re^n diat does not start with strand 0. Side entries are 
created \dien a previous strand performs a fimction call and the* code after the call is part of 
die same r^n. A new strand is created at die point of the cafl and die retum adi 
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b e a side entry to that strand An example side entry is shown 804. Side entries are also created 
vAxea the region construction process encounters complex conttol flows in \rfiere there-are 
control inflow edgps to ^ strand tliat are- outside of the r^on. In general the number of side 
entries within a region is miiiitxiized, especially for regions withia perfbrcsiance critical 
functions. 

Division into Sfxands 

This section describes the detail of how code within a particular region is subdivided into one 
more strands at die time of code generatioiL Various heuristic techniques are used to make 
decisions abovtt the partitioning in order to maximize the potmtial for parallelism between 
• individual strands within a regioiL 

Each r^on is subdivided into one more individiial strands. A strand represents an ascending 
sequence of contiguous instmctions from the original code. Each successive strand is allocated 
a higjier strand number. Thus in terms of relationship to the original code, the strand order is 
always the same as die origbal instruction order. 

Each operation in a r^on belongs to a particular strand. This region may itself consist of 
multiple basic blocks and therefore multiple branches. Thus some of the code within the 

region is conditional since it would not be executed if certain branches take particular courses. 
The scheduler is able to perform "^obal scheduling" ^;dlere operations are moved between 
basic blocks in order to lower execution time and make good use of functional unit resource 
availability. The conditional evaluation operations can be scheduled like other operations. This 
allows a great deal of flexibility in the arrangement of code. 

The order of operations within a given strand is fixed by die dq>endendes between operations 
within that strand If two opemtions may dq>end on each other (and thus their order cm 
affect program results) then the saine order must be maintained in the final schedule. The 
dependency rules between strands can be more flexible, however. In some circumstances 
operations that are potentially dependent can have their order transposed. 

There is a bose correspondence betvt^een strands and basic bbcks in die original program. 
However a sin^e basic bbck can be transformed into multiple strands in some drcumstances. 
Ihis happens if a basic block contains certain types of instructions or is beyond a certain ' 
lengdx Moreover, conditional branches that are normally represented as a singb instruction 
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ate btokm down into two opetations in the preferred embodiment Hiese are the condition 
evsduation and die branch itself. The branch is considered to be located in a sq)arate basic 
block fix>m die condition evaluation. 

A strand may be terminated by a store instmction. This is done to improve die potential 
parallelism in the region. The preferred embodiment supports speculative execution of bad 
instructions by boosting diem eadier dian potential^ aliased store instructions. Howvei; tiiis 
boosting can onfy be performed if die store and load are in different strands. Splittiiig strands 
at die point of a store aUows diis type of optimisation to be performed when tibe load 
operation is in die same basic block. A store m^ cause a strand split if dxere are subsequent 
loads in die same basic block diat are potentially (but not definiteljr) aliased to die store. 

Conditional Handling 

Figure 24 illustrates how die strand creation process handles conditional code. Some example 
code is shown 2405. A control flow graph 2406 for diat code (extracted from die branch 
stmcture) is shown widi basic blocks A to E, The delineation of die basic blocks is shown as 
2407. Block A is ah^ys entered at die start of die region. Blocks B and C are executed on a 
mutually exclusive basis and are formed from an IF-TfffiN-ELSE style construct Block C 
also has a conditional block D formed from an IF-THEN style construct Finally all control 
flow routes merge in basic bbckK 

Each basic block is allocated a new strand 241 1 as Htustrated in Figure 24. Each of the strands 
is numbered 2410. A strand is composed of a number of individual operations, depending 
upon die instructions present in die original basic blocks. Basic block A terminates in a 
conditional branch diat jumps over block B into block C Strand 0 holds operations for block 
A and two squash operations 2412 and 2413 are generated for the strand. The first squash 
2413 controls the entry to strand B (the fell throu^ case of the conditional branch). The 
second squash 2412 controls die execution of a specially generated strand 1. This strand holds 
the branch instruction to strand 3 ^^ch holds the code for block C. 

* The key provides information about die meaning of each of die operation types. An ordinary 
operation is shown as 2402 A squash operation is shown as 2403. A branch operation is 
shown as 2404. An operation diat becomes deleted as part of the processing is shown as 2401. 
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As the ccxie is first gpaerated it is not kno^ra whether a conditional branch will be to code 
inside of the r^n or not The extent of the regbn is determined as a dynamic process so it is 
not possible to determine beforehand if an actual branch will be required or not The r^on 
creation process assumes the worst case and generates branch operations. These are then 
deleted as soon as it is determined that the destination of the branch is within the region. For 
instance strand 1, which holds a bmich 2408 to block C of code in strand 4 is deleted as soon 
as block C is encountered, in the regp>n. If die r^on was terminated before bbck C was 
reached tiien the branch would remain. 

Strand 2, which contains code for bbdc B, includes a branch operation 2409. This is 
generated fix>m the unconditional branch in die original code at the end of block B. Since the 
branch is conditional it can remain in die same strand as the rest of the operations for the 
btock and it does not require a separate squash operatioiL When block E is reached (the 
destination of the branch) then the branch 2409 can be deleted. 

When a branch is deleted and it is the only operation within a strand then the whole strand 
can be reclaimed This is important since there is a limit of 16 strands that ntiay be supported 
and if that limit is reached then the region must be terminated. Reclaiming strands prevents 
r^ons being wisely truncated by this limit 

Thus this process converts the control flow present in the original code into a sequence of 
strands with appropriate squash operations. Conditional blocks of code are converted into 
conditional strands, allowing much greater scheduling fireedom. Only branches to destinations 
otrtside of the region remain as branches. This mechanism can convert arbitrary control flow 
into strand stmctures and can siqpport the region being terminated at any point diudng the 
translation process. 

Side Btitry Squashes 

If a r^on has side entties then special unconditional squash operations have to be inserted 
intx) the sUe entcy strand to ensure that only reachable strands are executed This is illustrated 
in Figure 9. 

The original code 905 forms an IF-THEN-ELSE construct The control flow graph is shown 
as 906. Dqjending on a condition evaluated in block A, either bbck B or block C will be 
executed The code is formed into a regbn of strands 912 each with an aDocated number 913. 
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The key shows the types of operation. An ordinary opetadon is 902. A squash operation is 
903. A branch operation is 904. An opctation that becomes deleted as part of the processing is 
901. 

Two squash operations in bbck A disable die appropriate strand depending upon which path 
is takea However, block B strand 907 contains a call operation 910. If the parii tiirougjh B is 
taken then the vshdie region is executed and a branch is made to tiie called functioa Block D 
is automaticalty squashed by the hardware as a branch £tom block B is taken so all hi^er 
numbered stcands are squashed. 

Block B is divided into two separate strands, 2 and 3. Strand 2 contains die translated code 
fix>m before the call including the call itself 910. Strand 3 contains the code in bbck B after 
die return from tiie cal Strand 3 is a side entry 909 since its address is used in an address Unk 
for return from the ftinctioru The return branch specifies strand 3 as die first strand to be 
executed in the region so strands 0 to 2 are automatically disabled and code form them is not 
repeated. Howevea; strand 4 containing code from bbck C is enabled but this is should not be 
executed as the padi through B has been taken. A special unconditional squash operation 91 1 
is inserted in strand 3 to disable block C In general, for each side entry squashes are issued for 
all hi^er numbered strands diat do not post-dominate the side entry. 

Squash Resolution 

When the constmction of the tcgon is completed the squashes within die region must be 
finally r^olveci At tiiis stage the strands are gKren tiieir final numbering so tiiat ^propriate 
immediate strand values can be specified for the squash operations. During the construction 
process itseli^ strands may be reclaimed as they become empty if an unnecessary branch 
operation is deleted Some squashes may be deleted at this stage if tiiey are shown to be 
redundant. 

Figure 25 shows the set of operations and strand rehtionships for die example code used to 
illustrate conditional handling. The example control flow graph is shown as 2507. This is 
mapped into a r^jbn containing a number of strands 2509, each allocated with a number 
2508. The key shows die types of operations. An ordinary operation is 2502. A squash 
operation is 2503. An operation that is deleted by the processing is 2501, 
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If a btandi is deleted then the squash associated vdth that btanch is modified to cx)ntcol the 
desdnation strand direcdy. For instance, the btandi to block C has been deleted becavise block 
C is withb the region. The squash widiin block A previously used to conttol that btanch is 
made to control block C dkectily 2504. The btanch and controlling squash 2506 at the end of , 
block C can be deleted as block E post-dommates a& blocks in the control flow graph. 

Some squashes may be used to control multiple strands. Pot instance, the squash for block C 
also controls block D with the same condition 2505. This is because bkxk D is dominated by 
block C In other words, the code must pass through bbck C in order to reach block D. 
Bbck D is also controlled by a squash operation in block C but if block C is itself squashed by 
block A then that squash operation will never be executed. Each squash operation specifies a 
mask of strands to allow muldple squashes to be initiated by a sing^ operation. If the 
necessary strands cannot be covered by a single squash operation liien the squash resolution 
stage may insert additional squashes. 

In g^eral, a squash operation performed in strand x to control strand y is rewritten to also 
apply the same condition to all strands 2, where 2 is dominated by x but does not post 
dominate x. Thus m the example block E is not included in the squashes of block A because 
all control flows pass through bbck E so its execution is unconditional During the r^on 
construction process a matdx of strand dq>endencies is created, albwiog domination and 
post-dominatbn relationships to be determined • 

All squashes to control strands that post-dominate the strand in vMch the squash is 
performed ate deleted. This is the case with the second squash in block D to control block E. 
Whether the path throu^ C is taken or not, block E is executed. Note that in the case of die 
squashes in block A, block C does not post-dominate bbck A (due to the unconditional 
branch at the end of block B) so the squash for it is retained. This rule albws appropriate code 
to be generated for diflSaing control topologies of IF-THEN and IF-THEN-ELSE 
constructs. 

Rfipfiesentatioa of Speculation Oppottumdes 

Weak dependence arcs ate used to provide hints to the scheduling algoridims about die Heal 
ordering of operations. However, unlike normal arcs the dependency rules can be broken and 
the operations issued out of order. This provides greater flexibility to the scheduler if it is 
attempting to improve code density by allowing greater scheduling fireedom. 
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In the ptefected embodiment die code is fomied into a gcaph irepresentadon. This tepresents 
the individual opetadons as nodes in die g^ph and dependencies between operations as arcs 
\ddiin the gtaph. 

A weak dependence arc has an associated conditional arc. This arc is only activated if the 
ordering given by die weak dependence arc is brokea Conditional arcs are used to make die 
issue of certain operations, to compensate for die weak arc violation, conditional 

Figure 4 illustrates die structure of weak and conditional arcs. A dependee operation 401 has a 
dependent operation 40Z Hie two operations are connected by a weak arc 405. As shown die 
weak control arc has an associated 406 conditional arc 404. This conditional arc is dependent 
upon a conditional operation diat is only performed if the weak control arc dependency order 
is violated It connects to die conditional operation 403. A single conditional operation may be 
the dependee of a number of conditional arcs 407, If any of these arcs are activated dien the 
conditional operation is issued. 

Memory Dependence Analysis 

Memory dependence analysis must be performed between memory access instructions in die 
code. This involves checking to determine if the access could be aliased with an eadier 
memory access. The accesses are aliased if they may potentially access the same memory 
location. If so, and one operation is a load and die other a store, then thdr ordering cannot be 
changed in the schedule as that could impact the results generated by die program. 
Dependence arcs are created between potentially aliased operations to ensure die correct 
ordering is maintained in the final schedule. 

AfiasCheddng 

Alias checks are made with all previous stores in the code that are reachable firom the strand 
of the new operation. If the previous store is firom a strand that is on a mutually exclusive 
control Bow path in the regbn then no depecdence arc needs to be generated. For instance, a 
region may indude an IF-THEN-ELSE control flow structure. Memory accesses in the ELSE 
part of the construct will not have dq)endency arcs generated with operations ficom the 
THEN part as the strand squashing will ensure that both paths are not executed Thus 
memory access operations 6x>m the ELSE pajrt can be issued before potentially aliased 
memory access operations &om the THEN part 
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If a previous store is aliased then the type of dependency arc generated vnSl depend upon a 
number of &ctors. If the previous store is \rithin the same strand then an ordinary conttol arc 
is g^erated betfxreen the two nodes to ensure th^ maintain the same order. If the nodes are 
in different strands then a wealc control arc may be g^emted. An associated conditional arc 
will also be generated This mechanism allows memory accesses to be performed out of order 
if appropriate recovety measures are put in pkce. 

Potentially Aliased Dependence 

In this case there is a potential dependency between an earlier store and a later load operation. 
In general ic is unlikely that the two accesses will actual^ be aliased In most cases issuing the 
load earlier than the store will still produce the correct results. However, the architecture must 
detect the cases in ^^pdiich they are aliased and provide an aj^ropdate recovery mechanism. 

Figure 11 shows an example of potentially aliased access handling. A weak control arc 1113 is 
generated between the store 1102 and the later load 1105. A special check hazard operation 
1104 is issued in the later strand holding die load A conditional arc 1114 is created from the 
store to the check hazard operation. Thus if the load is issued before the store in the schedule 
then the check hazard operation is issued. To enable this, the load and store must be in 
different strands, The store is in the logically earlier strand 1107 and the load is in the logically 
later strand 1108. The figure shows both data arcs 1110 and control arcs 1111. 

The check hazard operation requires the addresses generated for the store and load 
opemtions. Operation 1101 g^erates the address for the store. Operation 1103 generates the: 
address for the load via data arc 1115. The check hazard operation obtains those from the 
same operations that generate diem for the store and load Althou^ the check hazard is 
issued in the subsequent strand it is able to use the address calculated for the store in tiie 
earlier strand. Either the store or load strand can disable a check hazard If the store address is 
not valid then the operation is not performed 

The check hazard also has a dq>enden£y to the commit p oint 1 1 06 for die loading strand, via 
control arc 1112. This is required because the check hazard must be issued in the speculative 
phase of the loading strand as it has the potential to abort the strand A check is made to 
ensure that die address generation for the load is not a dq)endee of any operations that must 
be issued in the committed phase of the bading strand This would not be 1^ as it woukl 
make the graph acydic, as the check hazard must be issued ia the speculative phase. 
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The check hazard operation simply compares the bad and store addresses. If thqr are not 
equal then the operation has no efifect Thus if a bad is issued eariier than a potentially aliased 
store but the addresses are not actualty aliased at run time dien execution can continue \ 
normally. If the addresses are klentical then the sttand to which the check hazard bdongs (the 
load stpnd) is aborted. The later store can then be performed. The region is then te-executed 
and then check hazard Mrill not be performed as the store address will be generated from a 
disabled strand On ifae re-execudon the load obtains the correct data from the store 
performed previously. 

It is understood that there are many possible alternative embodiments of the invention. It is 
recognized that the description contained herein is only one possible embodiment This 
should not be taken as a limitation of the scope of the invention. The scope should be defined 
by the daims and we therefore assert as our invention all that comes within the scope and 
spirit of those claims. 



